MDSuite: comprehensive post-processing tool for particle simulations

Particle-Based (PB) simulations, including Molecular Dynamics (MD), provide access to system observables that are not easily available experimentally. However, in most cases, PB data needs to be processed after a simulation to extract these observables. One of the main challenges in post-processing PB simulations is managing the large amounts of data typically generated without incurring memory or computational capacity limitations. In this work, we introduce the post-processing tool: MDSuite. This software, developed in Python, combines state-of-the-art computing technologies such as TensorFlow, with modern data management tools such as HDF5 and SQL for a fast, scalable, and accurate PB data processing engine. This package, built around the principles of FAIR data, provides a memory safe, parallelized, and GPU accelerated environment for the analysis of particle simulations. The software currently offers 17 calculators for the computation of properties including diffusion coefficients, thermal conductivity, viscosity, radial distribution functions, coordination numbers, and more. Further, the object-oriented framework allows for the rapid implementation of new calculators or file-readers for different simulation software. The Python front-end provides a familiar interface for many users in the scientific community and a mild learning curve for the inexperienced. Future developments will include the introduction of more analysis associated with ab-initio methods, colloidal/macroscopic particle methods, and extension to experimental data. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-023-00687-y.


Molecule Mapping
Molecule mapping is a crucial component to many simulation studies and the implementation of it in MDSuite offers a range of potential applications. In this section the process of molecule mapping is discussed in more detail as well as the isomorphism checks used to further ensure that the correct molecules have been detected.

Molecule interface and detection
The molecele mapping in MDSuite is dictated by the information the user provides the molecule graph module. This is interfaced through the mdsuite.Molecule class which contains information about the structure of the molecue, as well some system information such as how many should be in each configuration. An example of this class is shown below: 1 import mdsuite as mds Alternatively, for custom molecules where SMILES strings cannot be built, users can use a dictionary reference such as: 1 import mdsuite as mds At this stage in MDSuite, there is no difference between the use of SMILES and the direct use of a species dictionary. In the future, with the addition of even stricter isomorphism test, there will be some distinction in how exact the molecule mapping can be performed without a reference graph.
The actual mapping of the molecules takes place in the following steps: 1 Build distance tensor between all particles of the species contained in one molecule.
2 Apply mask built from cutoff value to construct adjacency matrix for all particles.
3 Perform graph decomposition on full adjacency matrix to collect matrices of individual molecules.
By the end of this stage, what is returned is a list of the different molecule on which certain isomorphism test can be performed to further ensure their validity.

Isomorphism tests
In order to try and validate that the chosen molecules are correct, MDSuite performs small isomorphism tests and raises an error if they fail. The tests performed are listed below along with why they are chosen and what a user can do in the event that they fail.
1 Molecule amount test: In this test, MDSuite simply checks whether the total number of molecules matches that of the reference information. This is the first and most simple test performed. Typically, a failure here is as a result of an incorrect cutoff. If the cutoff is too large, not enough molecules are found, if it is too small, too many molecules will be found.
2 Group equality test: The second, and at this stage final, test performed by MDSuite is that of Group equality. When MDSuite constructs molecules it does so by building reference data structure with information about which particle species are in each molecule as well as how many. For example, for a water molecule it would state that there should exists one oxygen and two hydrogen particles in each molecule. This test checks to see whether this is true for all of the computed molecules. If this test fails, particles are either not being found, or are so close at this point in the simulation that the module cannot separate them. Resolving this failure is usually achieved by adjusting the cutoff value provided, or by using an alternative reference configuration in which the molecules are more separated.

Algorithms
Throughout the main article reference has been made to several algorithms employed by MDSuite for several calculations. In this section, we expand on why some of these approaches are used as well as where they can be found in the code.

Savitsky Golav Filter
In several types of analysis it is necessary to determine the minimum of a curve within a specified range. To perform such an analysis, it is important the curve being studied is smoothed, or at least, has a smoothed variant that can be used as a reference point for the true data. This point is addressed in MDSuite by using a Savitzky-Golay, or SavGol filter [1], to produce a smoothed copy of a function before a peak-finding code is applied. The filter works by expanding each data point out in a set of low-degree polynomial functions as where M j are convolution coefficients taken from an order n polynomial. In MD-Suite, these parameters can be set by the user for the repsective application. The main benefit of SavGol filter compared to alternatives, such a moving average file, is that the location of the maxima and minima remains unaltered. In MDSuite, the Scipy implementation of the SavGol filter is called within a separately defined method to apply the filter operation. [2].

Golden Section Search
Another algorithm implemented in associate with minimum detection is the Golden-Sector search algorithm developed by Kiefer in 1952 [3]. The algorithm works as most extremum search methods do, by selecting points iteratively and reducing the search range until the value is found. In order to optimise the rate at which the algorithm works, successive intervals are chosen such that the new test range satisfies the golden ration. By providing a suitable range, the Golden-Sector search algorithm converges onto a function minimum with excellent accuracy. In order to ascertain an error value, the MDSuite implementation returns a range within which the minimum exists.

Min-Finding
The Golden-Section search algorithms is not enough on its own to ascertain the minimum value of a function as it requires and defined range over which to look for an extremum. Rather, in MDSuite, we combine the filtering capacity of the SavGol filter with the min-finding capabilities of the Golden-Section search. First, the SavGol filter is applied to the raw signal in order to generate a smoothed copy of the function. On this copy, the typical scipy peak-finding algorithm is applied which simply uses neighbour comparisons to detect the maximums of the function.
These peaks are then parsed to the Golden-Section search algorithm as an initial range within which the minimum exists.

Memory Management
Memory management is a fundamental concept at the heart of MDSuite. The ability to perform memory safe analysis enables MDSuite calculators to examine the largest systems on almost any available device. Beyond providing a foundation for the analysis of large data-sets, memory safety also provides the means for pushing expensive calculations onto performance devices such as GPUs or, theoretically, TPUs.
MDSuite achieves memory safety by computing the scale functions of our calculators. This is to say, each calculator and transformation has with it an associated scaling function which describes how the function scales with respect to additional particles or time-steps. At the beginning of each calculation, the theoretical memory consumption of a computation is derived and a batch size computed accordingly. If there exists no batch size capable of running the analysis, MDSuite will then move to so-called atom-wise or particle-wise batching. In this case, rather than trying to load in as many configurations as possible with all of the particles, this memory requirements are computed using a subset of the available particles. Therefore, the minimum memory requirement of any device is a single particles worth of the desired data range in the case of a dynamics calculation.